Search results for "Correlation clustering"
showing 10 items of 28 documents
Quantum clustering in non-spherical data distributions: Finding a suitable number of clusters
2017
Quantum Clustering (QC) provides an alternative approach to clustering algorithms, several of which are based on geometric relationships between data points. Instead, QC makes use of quantum mechanics concepts to find structures (clusters) in data sets by finding the minima of a quantum potential. The starting point of QC is a Parzen estimator with a fixed length scale, which significantly affects the final cluster allocation. This dependence on an adjustable parameter is common to other methods. We propose a framework to find suitable values of the length parameter σ by optimising twin measures of cluster separation and consistency for a given cluster number. This is an extension of the Se…
CUDA-enabled hierarchical ward clustering of protein structures based on the nearest neighbour chain algorithm
2015
Clustering of molecular systems according to their three-dimensional structure is an important step in many bioinformatics workflows. In applications such as docking or structure prediction, many algorithms initially generate large numbers of candidate poses (or decoys), which are then clustered to allow for subsequent computationally expensive evaluations of reasonable representatives. Since the number of such candidates can easily range from thousands to millions, performing the clustering on standard central processing units (CPUs) is highly time consuming. In this paper, we analyse and evaluate different approaches to parallelize the nearest neighbour chain algorithm to perform hierarc…
Fast dendrogram-based OTU clustering using sequence embedding
2014
Biodiversity assessment is an important step in a metagenomic processing pipeline. The biodiversity of a microbial metagenome is often estimated by grouping its 16S rRNA reads into operational taxonomic units or OTUs. These metagenomic datasets are typically large and hence require effective yet accurate computational methods for processing.In this paper, we introduce a new hierarchical clustering method called CRiSPy-Embed which aims to produce high-quality clustering results at a low computational cost. We tackle two computational issues of the current OTU hierarchical clustering approach: (1) the compute-intensive sequence alignment operation for building the distance matrix and (2) the …
SMART: Unique splitting-while-merging framework for gene clustering
2014
© 2014 Fa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named "splitting merging awareness tactics" (SMART), which does not require any a priori knowledge of either the number …
GenClust: A genetic algorithm for clustering gene expression data
2005
Abstract Background Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. Results GenClust is a new genetic algorithm for clustering gene expression data. It has two key features: (a) a novel coding of the search space that is simple, …
Data Analysis and Bioinformatics
2007
Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the complications of very large data-sets with many attributes of different types. And this is a typical situation in biology. Some cases studies are also described.
Distance Functions, Clustering Algorithms and Microarray Data Analysis
2010
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…
Structural clustering of millions of molecular graphs
2014
We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…
The Three Steps of Clustering In The Post-Genomic Era
2013
This chapter descibes the basic algorithmic components that are involved in clustering, with particular attention to classification of microarray data.
A Greedy Algorithm for Hierarchical Complete Linkage Clustering
2014
We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.